TOC

  1. Data Preparation
  2. Model Building
  3. Analysis of the Contribution of Each Explanatory Variable
  4. Prediction on Electricity Consumption

1. Data Preparation

1.1 Introduction to the Dataset

EnerNOC GreenButton Data is a subset of the Open EnerNOC data repository. The raw data was provided by the EnerNOC electricity supplier and contains anonymous 5-minute electricity consumption data for 100 commercial/industrial sites for the year 2012. The simplified version contains data at 30-minute intervals.

The explanatory variables are:

  • value: Electricity consumption by timestamp
  • week: In which week was the data collected
  • date: On which day was the data collected
  • type: Type of the building

1.2 Overview of the Dataset

Load Necessary Packages

library(feather)
library(data.table)
library(mgcv)
## Loading required package: nlme
## This is mgcv 1.9-1. For overview type 'help("mgcv-package")'.
library(car)
## Loading required package: carData
library(ggplot2)
library(dplyr, warn.conflicts = FALSE)

Read the Data and show the overview of the data.

DT <- as.data.table(read_feather("DT_4_ind"))
str(DT)
## Classes 'data.table' and 'data.frame':   70080 obs. of  5 variables:
##  $ date_time: POSIXct, format: "2012-01-02 00:00:00" "2012-01-02 00:30:00" ...
##  $ value    : num  1590 1564 1560 1585 1604 ...
##  $ week     : chr  "Monday" "Monday" "Monday" "Monday" ...
##  $ date     : Date, format: "2012-01-02" "2012-01-02" ...
##  $ type     : chr  "Commercial Property" "Commercial Property" "Commercial Property" "Commercial Property" ...
##  - attr(*, ".internal.selfref")=<externalptr>

Plot the Data

ggplot(data = DT, aes(x = date, y = value)) +
  geom_line() + 
  facet_grid(type ~ ., scales = "free_y") +
  theme(panel.border = element_blank(),
        panel.background = element_blank(),
        panel.grid.minor = element_line(colour = "grey90"),
        panel.grid.major = element_line(colour = "grey90"),
        panel.grid.major.x = element_line(colour = "grey90"),
        axis.text = element_text(size = 10),
        axis.title = element_text(size = 12, face = "bold"),
        strip.text = element_text(size = 9, face = "bold")) +
  labs(x = "Date", y = "Load (kW)")

We can see that the electricity consumption of the category Food Sales & Storage is not affected by weekdays or weekends.

1.3 Processing of Information about Days and Weeks

We use the car::record() function to easily describe the relationship between the electricity consumption and each day of the week. We do this by appending a new column that associates each day of the week with a unique number

DT[, week_num := as.integer(car::recode(week,
    "'Monday'='1';'Tuesday'='2';'Wednesday'='3';'Thursday'='4';
    'Friday'='5';'Saturday'='6';'Sunday'='7'"))]
unique(DT[, week])
## [1] "Monday"    "Tuesday"   "Wednesday" "Thursday"  "Friday"    "Saturday" 
## [7] "Sunday"
unique(DT[, week_num])
## [1] 1 2 3 4 5 6 7

We extract information related to “industry”, “data”, “week” and “period” from the dataset. Since the data was collected every half an hour, there’re 48 consecutive observations within a day, thus we have period <- 48.

n_type <- unique(DT[, type])
n_date <- unique(DT[, date])
n_weekdays <- unique(DT[, week])
period <- 48

We select the electricity consumption of a commercial building, store it as the variable data_r and plot the data.

type == n_type[1] stands for “Commercial Property”, whereas date %in% n_date[57:70] corresponds to two weeks.

data_r <- DT[(type == n_type[1] & date %in% n_date[57:70])]

ggplot(data_r, aes(date_time, value)) +
  geom_line() +
  theme(panel.border = element_blank(),
        panel.background = element_blank(),
        panel.grid.minor = element_line(colour = "grey90"),
        panel.grid.major = element_line(colour = "grey90"),
        panel.grid.major.x = element_line(colour = "grey90"),
        axis.text = element_text(size = 10),
        axis.title = element_text(size = 12, face = "bold")) +
  labs(x = "Date", y = "Load (kW)")

We re-organize the data in accordance with the change of days and weeks.

N <- nrow(data_r) # number of rows in the training set
window <- N / period # number of days in the training set
matrix_gam <- data.table(Load = data_r[, value],
                         Daily = rep(1:period, window),
                         Weekly = data_r[, week_num])
head(matrix_gam)
##        Load Daily Weekly
##       <num> <int>  <int>
## 1: 1630.875     1      1
## 2: 1611.201     2      1
## 3: 1657.160     3      1
## 4: 1653.042     4      1
## 5: 1702.316     5      1
## 6: 1648.411     6      1

2. Model Building

2.1 The First Model

We use the mgcv:gam() function to build the GAM model. The periodic change in days is described by a “cubic regression spline”, whereas the periodic change in weeks is described by “P-splines”.

gam_1 <- gam(Load ~ s(Daily, bs = "cr", k = period) +
               s(Weekly, bs = "ps", k = 7),
             data = matrix_gam,
             family = gaussian)

Inspect the summary of the model.

summary(gam_1)$r.sq
## [1] 0.7718406
summary(gam_1)$sp.criterion
##   GCV.Cp 
## 245544.9

GCV is an indicator of the fit of the model. The lower this value is, the fitter the model is. In addition we can see that R-sq is not high, which indicates the bad performance of the model.

Compare the difference between the prediction and the reality over those two weeks.

matrix_gam$Predict=gam_1$fitted.values
ggplot(matrix_gam[1:nrow(matrix_gam),], aes(1:nrow(matrix_gam)))+
           labs("lab")+
           geom_line(aes(y=Load, color="Real"), size = 0.8)+
           geom_line(aes(y = Predict, color = "Predict"), size = 0.8)

It doesn’t look nice. We carefully inspect the electricity consumption during the first week.

row_Mon <- nrow(matrix_gam)/2
matrix_gam$Predict=gam_1$fitted.values
ggplot(matrix_gam[1:row_Mon,], aes(1:row_Mon))+
           labs("lab")+
           geom_line(aes(y=Load, color="Real"), size = 0.8)+
           geom_line(aes(y = Predict, color = "Predict"), size = 0.8)

This model can only predict the trend of the weekdays’ electricity consumption, but fails in predicting the exact amount of electricity consumption. We do a detailed inspection on the electricity consumption on Monday.

row_Mon <- nrow(matrix_gam)/14
matrix_gam$Predict=gam_1$fitted.values
ggplot(matrix_gam[1:row_Mon,], aes(1:row_Mon))+
           labs("lab")+
           geom_line(aes(y=Load, color="Real"), size = 0.8)+
           geom_line(aes(y = Predict, color = "Predict"), size = 0.8)

The problem is that the real electricity consumption at the end of the day is not at the same level as it at the beginning of the day, but the model shows a pure periodic change within the day, which is different from the reality. Thus we need to change our mindset and build another model.

2.2 The Second Model

This time we use the method of interaction between different scale and build a model by combining Daily and Weekly.

gam_2 <- gam(Load ~ s(Daily, Weekly),
             data = matrix_gam,
             family = gaussian)
 
summary(gam_2)$r.sq
## [1] 0.9352108
summary(gam_2)$sp.criterion
##   GCV.Cp 
## 71162.37

According to R.sq and p-value, we can say that this model performs better.

Plot the difference between predicted result and the real result.

datas <- rbindlist(list(data_r[, .(value, date_time)],
                        data.table(value = gam_2$fitted.values,
                                   data_time = data_r[, date_time])))
## Column 2 ['data_time'] of item 2 is missing in item 1. Use fill=TRUE to fill with NA (NULL for list columns), or use.names=FALSE to ignore column names. use.names='check' (default from v1.12.2) emits this message and proceeds as if use.names=FALSE for  backwards compatibility. See news item 5 in v1.12.2 for options to control this message.
datas[, type := c(rep("Real", nrow(data_r)), rep("Fitted", nrow(data_r)))]
 
ggplot(data = datas, aes(date_time, value, group = type, colour = type)) +
  geom_line(size = 0.8) +
  theme_bw() +
  labs(x = "Time", y = "Load (kW)",
       title = "Fit from GAM n.2")

We can see that the fit during Monday to Thursday significantly improved.

2.3 The Third Model

Next, we use another advanced method of interaction and use a smooth function called “tensor product”.

gam_3 <- gam(Load ~ te(Daily, Weekly,
                       bs = c("cr", "ps")),
             data = matrix_gam,
             family = gaussian)
 
summary(gam_3)$r.sq
## [1] 0.9268452
summary(gam_3)$sp.criterion
##  GCV.Cp 
## 79724.9

Plot gam_3

datas <- rbindlist(list(data_r[, .(value, date_time)],
                        data.table(value = gam_3$fitted.values,
                                   data_time = data_r[, date_time])))
## Column 2 ['data_time'] of item 2 is missing in item 1. Use fill=TRUE to fill with NA (NULL for list columns), or use.names=FALSE to ignore column names. use.names='check' (default from v1.12.2) emits this message and proceeds as if use.names=FALSE for  backwards compatibility. See news item 5 in v1.12.2 for options to control this message.
datas[, type := c(rep("Real", nrow(data_r)), rep("Fitted", nrow(data_r)))]
 
ggplot(data = datas, aes(date_time, value, group = type, colour = type)) +
  geom_line(size = 0.8) +
  theme_bw() +
  labs(x = "Time", y = "Load (kW)",
       title = "Fit from GAM n.3")

2.4 The Fourth Model

We can make it even better. For example let the knots (a concept that is similar to dimension) fit the periodic change of days and weeks better.

gam_4 <- gam(Load ~ te(Daily, Weekly,
                        k = c(period, 7),
                        bs = c("cr", "ps")),
              data = matrix_gam,
              family = gaussian)
 
summary(gam_4)$r.sq
## [1] 0.9727604
summary(gam_4)$sp.criterion
##   GCV.Cp 
## 34839.46

We can see that R-sq has increased a little bit. But the most significant is edf, which has increased 5 times.

Plot gam_4

datas <- rbindlist(list(data_r[, .(value, date_time)],
                        data.table(value = gam_4$fitted.values,
                                   data_time = data_r[, date_time])))
## Column 2 ['data_time'] of item 2 is missing in item 1. Use fill=TRUE to fill with NA (NULL for list columns), or use.names=FALSE to ignore column names. use.names='check' (default from v1.12.2) emits this message and proceeds as if use.names=FALSE for  backwards compatibility. See news item 5 in v1.12.2 for options to control this message.
datas[, type := c(rep("Real", nrow(data_r)), rep("Fitted", nrow(data_r)))]
 
ggplot(data = datas, aes(date_time, value, group = type, colour = type)) +
  geom_line(size = 0.8) +
  theme_bw() +
  labs(x = "Time", y = "Load (kW)",
       title = "Fit from GAM n.4")

2.5 The Fifth Model

All right, what about be greedier and combine all the previous models? Let’s examine our thought by building gam_5

gam_5 <- gam(Load ~ s(Daily, bs = "cr", k = period) +
                    s(Weekly, bs = "ps", k = 7) +
                    ti(Daily, Weekly,
                       k = c(period, 7),
                       bs = c("cr", "ps")),
             data = matrix_gam,
             family = gaussian)
 
summary(gam_5)$r.sq
## [1] 0.9717469
summary(gam_5)$sp.criterion
##   GCV.Cp 
## 35772.35

Although p-value remains \(0\), but R-sq decreased and GCV increased. This means that the performance is not as good as the previous gam_4 model.

Plot gam_5

datas <- rbindlist(list(data_r[, .(value, date_time)],
                        data.table(value = gam_5$fitted.values,
                                   data_time = data_r[, date_time])))
## Column 2 ['data_time'] of item 2 is missing in item 1. Use fill=TRUE to fill with NA (NULL for list columns), or use.names=FALSE to ignore column names. use.names='check' (default from v1.12.2) emits this message and proceeds as if use.names=FALSE for  backwards compatibility. See news item 5 in v1.12.2 for options to control this message.
datas[, type := c(rep("Real", nrow(data_r)), rep("Fitted", nrow(data_r)))]
 
ggplot(data = datas, aes(date_time, value, group = type, colour = type)) +
  geom_line(size = 0.8) +
  theme_bw() +
  labs(x = "Time", y = "Load (kW)",
       title = "Fit from GAM n.5")

2.6 The Sixth Model

Now is our last attempt. Here we add another method of tensor product interations and introduce a stricter condition by setting full = TRUE.

gam_6 <- gam(Load ~ t2(Daily, Weekly,
                       k = c(period, 7),
                       bs = c("cr", "ps"),
                       full = TRUE),
             data = matrix_gam,
             family = gaussian)
 
summary(gam_6)$r.sq
## [1] 0.9738273
summary(gam_6)$sp.criterion
##   GCV.Cp 
## 32230.68

Plot gam_6

datas <- rbindlist(list(data_r[, .(value, date_time)],
                        data.table(value = gam_6$fitted.values,
                                   data_time = data_r[, date_time])))
## Column 2 ['data_time'] of item 2 is missing in item 1. Use fill=TRUE to fill with NA (NULL for list columns), or use.names=FALSE to ignore column names. use.names='check' (default from v1.12.2) emits this message and proceeds as if use.names=FALSE for  backwards compatibility. See news item 5 in v1.12.2 for options to control this message.
datas[, type := c(rep("Real", nrow(data_r)), rep("Fitted", nrow(data_r)))]
 
ggplot(data = datas, aes(date_time, value, group = type, colour = type)) +
  geom_line(size = 0.8) +
  theme_bw() +
  labs(x = "Time", y = "Load (kW)",
       title = "Fit from GAM n.6")

This plot looks even better.

2.7 Comparison of the Models

With so many models, how to decide which one is the best? Just ask the omnipotent AIC.

AIC(gam_1, gam_2, gam_3, gam_4, gam_5, gam_6)
##              df       AIC
## gam_1  17.46996 10248.993
## gam_2  30.70080  9415.768
## gam_3  25.65709  9492.545
## gam_4 121.41166  8912.611
## gam_5 115.80849  8932.746
## gam_6 100.12005  8868.628

Apparently gam_4, gam_5, gam_6 are on the leading board. gam_6 has the best performance, while gam_4 comes in second.

Next we plot gam_4, gam_6 together and see what we found.

layout(matrix(1:2, nrow = 1))
plot(gam_4, rug = FALSE, se = FALSE, n2 = 80, main = "gam n.4 with te()")
plot(gam_6, rug = FALSE, se = FALSE, n2 = 80, main = "gam n.6 with t2()")

These contour lines indicate each model’s response on Weekly and Daily. Although they look similar, the contour of gam_6 has more wave-like patterns. This is an indication of its higher sensitivity.

Visualization of the Best Performing Model

Before the end of this section, let’s see what we can do to make the plot of gam_6 looks better. Firstly we use the vis.gam function from the package mgcv.

# vis.gam(gam_6, main = "t2(D, W)", plot.type = "contour",
#         color = "terrain", contour.col = "black", lwd = 2)
vis.gam(gam_6, main = "t2(D, W)", 
        color = "terrain", contour.col = "black", lwd = 2)

We can see that the electricity consumption on weekdays are higher than on weekends. The peak hours are arround 3 pm from Monday to Thursday.

Without using the contour.col option, we can make a 3D plot.

vis.gam(gam_6, n.grid = 50, theta = 35, phi = 32, zlab = "",
        ticktype = "detailed", color = "topo", main = "t2(D, W)")

Change the viewing angle

vis.gam(gam_6, n.grid = 50, theta = 190, phi = 20, zlab = "",
        ticktype = "detailed", color = "topo", main = "t2(D, W)")

3. Analysis of the Contribution of Each Explanatory Variable

3.1 Models Building

Now let’s see what would happen if we discard explanatory variables one by one.

gam_6D <- gam(Load ~ t2(Daily, 
                       k = period,
                       bs = "cr",
                       full = TRUE),
             data = matrix_gam,
             family = gaussian)
 
summary(gam_6D)$r.sq
## [1] 0.5129567
summary(gam_6D)$sp.criterion
##   GCV.Cp 
## 518169.2

Em, it doesn’t look good.

Now let’s discard variable Daily

gam_6W <- gam(Load ~ t2(Weekly,
                       k = 7,
                       bs =  "ps",
                       full = TRUE),
             data = matrix_gam,
             family = gaussian)
 
summary(gam_6W)$r.sq
## [1] 0.2502016
summary(gam_6W)$sp.criterion
##   GCV.Cp 
## 791454.6

We can see it looks much worse.

Let’s maintain a rigorous attitude and use ANOVA to compare the difference between these three leading models. Firstly we discard variable Weekly and see what will happen

anova(gam_6, gam_6D, test="F")
## Analysis of Deviance Table
## 
## Model 1: Load ~ t2(Daily, Weekly, k = c(period, 7), bs = c("cr", "ps"), 
##     full = TRUE)
## Model 2: Load ~ t2(Daily, k = period, bs = "cr", full = TRUE)
##   Resid. Df Resid. Dev      Df   Deviance      F    Pr(>F)    
## 1    550.77   15740821                                        
## 2    661.13  339050831 -110.37 -323310010 106.61 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Then we discard variable Daily and see what will happen

anova(gam_6, gam_6W, test="F")
## Analysis of Deviance Table
## 
## Model 1: Load ~ t2(Daily, Weekly, k = c(period, 7), bs = c("cr", "ps"), 
##     full = TRUE)
## Model 2: Load ~ t2(Weekly, k = 7, bs = "ps", full = TRUE)
##   Resid. Df Resid. Dev      Df   Deviance     F    Pr(>F)    
## 1    550.77   15740821                                       
## 2    667.88  526095056 -117.11 -510354235 158.6 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The result is clear – non of the variable Weekly and variable Daily can be dropped!

3.2 Model Plotting

Plot the result of discarding variable Weekly.

datas <- rbindlist(list(data_r[, .(value, date_time)],
                        data.table(value = gam_6D$fitted.values,
                                   data_time = data_r[, date_time])))
## Column 2 ['data_time'] of item 2 is missing in item 1. Use fill=TRUE to fill with NA (NULL for list columns), or use.names=FALSE to ignore column names. use.names='check' (default from v1.12.2) emits this message and proceeds as if use.names=FALSE for  backwards compatibility. See news item 5 in v1.12.2 for options to control this message.
datas[, type := c(rep("Real", nrow(data_r)), rep("Fitted", nrow(data_r)))]
 
ggplot(data = datas, aes(date_time, value, group = type, colour = type)) +
  geom_line(size = 0.8) +
  theme_bw() +
  labs(x = "Time", y = "Load (kW)",
       title = "Fit from GAM n.6")

We can see that there is no weekly difference in the electricity consumption when the variable Weekly is dropped.

Plot the result of discarding variable Daily.

datas <- rbindlist(list(data_r[, .(value, date_time)],
                        data.table(value = gam_6W$fitted.values,
                                   data_time = data_r[, date_time])))
## Column 2 ['data_time'] of item 2 is missing in item 1. Use fill=TRUE to fill with NA (NULL for list columns), or use.names=FALSE to ignore column names. use.names='check' (default from v1.12.2) emits this message and proceeds as if use.names=FALSE for  backwards compatibility. See news item 5 in v1.12.2 for options to control this message.
datas[, type := c(rep("Real", nrow(data_r)), rep("Fitted", nrow(data_r)))]
 
ggplot(data = datas, aes(date_time, value, group = type, colour = type)) +
  geom_line(size = 0.8) +
  theme_bw() +
  labs(x = "Time", y = "Load (kW)",
       title = "Fit from GAM n.6")

We can see that there is no difference in the electricity consumption over the 24 hours of a day when the variable Daily is dropped.

4. Prediction on Electricity Consumption

Lastly, the most exciting part – let’s predict the electricity consumption for the next two weeks.

data_test <- DT[(type == n_type[1] & date %in% n_date[71:84])]
matrix_test <- data.table(Load = data_test[, value],
                           Daily = rep(1:period, window),
                           Weekly = data_test[, week_num])
pred_week <- predict(gam_6, matrix_test[1:(7*period)],interval="confidence", level = 0.95)

datat <- rbindlist(list(data_test[, .(value, date_time)],
                        data.table(value = pred_week,
                                   data_time = data_test[, date_time])))
## Column 2 ['data_time'] of item 2 is missing in item 1. Use fill=TRUE to fill with NA (NULL for list columns), or use.names=FALSE to ignore column names. use.names='check' (default from v1.12.2) emits this message and proceeds as if use.names=FALSE for  backwards compatibility. See news item 5 in v1.12.2 for options to control this message.
datat[, type := c(rep("Real", nrow(data_test)), rep("Predicted", nrow(data_test)))]
ggplot(data = datat, aes(date_time, value, group = type, colour = type)) +
  geom_line(size = 0.8) +
  theme_bw() +
  labs(x = "Time", y = "Load (kW)",
       title = "Predicted result on GAM n.6")

Predict the electricity consumption of the next month.

data_test <- DT[(type == n_type[1] & date %in% n_date[71:98])]
matrix_test <- data.table(Load = data_test[, value],
                           Daily = rep(1:period, window),
                           Weekly = data_test[, week_num])
pred_week <- predict(gam_6, matrix_test[1:(7*period)],interval="confidence", level = 0.95)

datat <- rbindlist(list(data_test[, .(value, date_time)],
                        data.table(value = pred_week,
                                   data_time = data_test[, date_time])))
## Column 2 ['data_time'] of item 2 is missing in item 1. Use fill=TRUE to fill with NA (NULL for list columns), or use.names=FALSE to ignore column names. use.names='check' (default from v1.12.2) emits this message and proceeds as if use.names=FALSE for  backwards compatibility. See news item 5 in v1.12.2 for options to control this message.
datat[, type := c(rep("Real", nrow(data_test)), rep("Predicted", nrow(data_test)))]
ggplot(data = datat, aes(date_time, value, group = type, colour = type)) +
  geom_line(size = 0.8) +
  theme_bw() +
  labs(x = "Time", y = "Load (kW)",
       title = "Predicted result on GAM n.6")

5 Additional Information

Additional content that was added at the end of the semester.

Function Definition

Define the MAPE function.

mape <- function(real, pred){
  return(100 * mean(abs((real - pred)/real)))
}

Define the criteria for model evaluation (R-sq, GCV, MAPE).

gam_eval <- function(model){
    return(data.table(RSQ=summary(model)$r.sq, 
                      GCV=summary(model)$sp.criterion, 
                      MAPE=mape(data_new[, value], model$fitted.values)))
}

Define the function for plotting models.

gam_plot <- function(model, title){
    datas <- rbindlist(list(data_new[, .(value, date_time)],
                        data.table(value = model$fitted.values,
                                   data_time = data_new[, date_time])))
datas[, type := c(rep("Real", nrow(data_new)), rep("Fitted", nrow(data_new)))]
 
ggplot(data = datas, aes(date_time, value, group = type, colour = type)) +
  geom_line(size = 0.8) +
  theme_bw() +
  labs(x = "Time", y = "Load (kW)",
       title = title)
}

Model Building

Use the first season’s data to train a model.

data_new <- DT[(type == n_type[1] & date %in% n_date[1:91])]

ggplot(data_new, aes(date_time, value)) +
  geom_line() +
  theme(panel.border = element_blank(),
        panel.background = element_blank(),
        panel.grid.minor = element_line(colour = "grey90"),
        panel.grid.major = element_line(colour = "grey90"),
        panel.grid.major.x = element_line(colour = "grey90"),
        axis.text = element_text(size = 10),
        axis.title = element_text(size = 12, face = "bold")) +
  labs(x = "Date", y = "Load (kW)")

Re-organize the data and add information about the electricity consumption of the previous one day and the previous week.

matrix_new <- data.table(Load = data_new[, value],
                         PrevDayLoad = c(data_new[1:48, value], data_new[1:4320, value]),
                         PrevWeekLoad = c(data_new[1:336, value], data_new[1:4032, value]),
                         Daily = rep(1:period, window),
                         Weekly = data_r[, week_num])
## Warning in as.data.table.list(x, keep.rownames = keep.rownames, check.names =
## check.names, : Item 4 has 672 rows but longest item has 4368; recycled with
## remainder.
## Warning in as.data.table.list(x, keep.rownames = keep.rownames, check.names =
## check.names, : Item 5 has 672 rows but longest item has 4368; recycled with
## remainder.

Build the first model.

gam_new_1 <- gam(Load ~ s(Daily, bs = "cr", k = period) +
               s(Weekly, bs = "ps", k = 7),
             data = matrix_new,
             family = gaussian)

Evaluate the first model.

eval_1 <- gam_eval(gam_new_1)
eval_1
##          RSQ      GCV     MAPE
##        <num>    <num>    <num>
## 1: 0.7194149 311287.5 18.89976

Plot the first model.

gam_plot(gam_new_1, "Fit from gam_new_1")
## Column 2 ['data_time'] of item 2 is missing in item 1. Use fill=TRUE to fill with NA (NULL for list columns), or use.names=FALSE to ignore column names. use.names='check' (default from v1.12.2) emits this message and proceeds as if use.names=FALSE for  backwards compatibility. See news item 5 in v1.12.2 for options to control this message.

Build the second model.

gam_new_2 <- gam(Load ~ s(Daily, Weekly),
             data = matrix_new,
             family = gaussian)

Evaluate the second model.

eval_2 <- gam_eval(gam_new_2)
eval_2
##          RSQ      GCV     MAPE
##        <num>    <num>    <num>
## 1: 0.8487935 168039.3 10.62365

Plot the second model.

gam_plot(gam_new_2, "Fit from gam_new_2")
## Column 2 ['data_time'] of item 2 is missing in item 1. Use fill=TRUE to fill with NA (NULL for list columns), or use.names=FALSE to ignore column names. use.names='check' (default from v1.12.2) emits this message and proceeds as if use.names=FALSE for  backwards compatibility. See news item 5 in v1.12.2 for options to control this message.

Build the third model.

gam_new_3 <- gam(Load ~ te(Daily, Weekly,
                       bs = c("cr", "ps")),
             data = matrix_new,
             family = gaussian)

Evaluate the third model.

eval_3 <- gam_eval(gam_new_3)
eval_3
##          RSQ      GCV     MAPE
##        <num>    <num>    <num>
## 1: 0.8423209 175028.3 10.18985

Plot the third model.

gam_plot(gam_new_3, "Fit from gam_new_3")
## Column 2 ['data_time'] of item 2 is missing in item 1. Use fill=TRUE to fill with NA (NULL for list columns), or use.names=FALSE to ignore column names. use.names='check' (default from v1.12.2) emits this message and proceeds as if use.names=FALSE for  backwards compatibility. See news item 5 in v1.12.2 for options to control this message.

Build the fourth model.

gam_new_4 <- gam(Load ~ te(Daily, Weekly,
                        k = c(period, 7),
                        bs = c("cr", "ps")),
              data = matrix_new,
              family = gaussian)

Evaluate the fourth model.

eval_4 <- gam_eval(gam_new_4)
eval_4
##          RSQ      GCV     MAPE
##        <num>    <num>    <num>
## 1: 0.8796766 135799.2 8.319575

Plot the fourth model.

gam_plot(gam_new_4, "Fit from gam_new_4")
## Column 2 ['data_time'] of item 2 is missing in item 1. Use fill=TRUE to fill with NA (NULL for list columns), or use.names=FALSE to ignore column names. use.names='check' (default from v1.12.2) emits this message and proceeds as if use.names=FALSE for  backwards compatibility. See news item 5 in v1.12.2 for options to control this message.

Build the fifth model.

gam_new_5 <- gam(Load ~ s(Daily, bs = "cr", k = period) +
                    s(Weekly, bs = "ps", k = 7) +
                    ti(Daily, Weekly,
                       k = c(period, 7),
                       bs = c("cr", "ps")),
             data = matrix_new,
             family = gaussian)

Evaluate the fifth model.

eval_5 <- gam_eval(gam_new_5)
eval_5
##          RSQ    GCV     MAPE
##        <num>  <num>    <num>
## 1: 0.8801601 135306 8.337419

Plot the fifth model.

gam_plot(gam_new_5, "Fit from gam_new_5")
## Column 2 ['data_time'] of item 2 is missing in item 1. Use fill=TRUE to fill with NA (NULL for list columns), or use.names=FALSE to ignore column names. use.names='check' (default from v1.12.2) emits this message and proceeds as if use.names=FALSE for  backwards compatibility. See news item 5 in v1.12.2 for options to control this message.

Build the sixth model.

gam_new_6 <- gam(Load ~ t2(Daily, Weekly,
                       k = c(period, 7),
                       bs = c("cr", "ps"),
                       full = TRUE),
             data = matrix_new,
             family = gaussian)

Evaluate the sixth model.

eval_6 <- gam_eval(gam_new_6)
eval_6
##          RSQ      GCV    MAPE
##        <num>    <num>   <num>
## 1: 0.8802681 134768.2 8.32655

Plot the sixth model.

gam_plot(gam_new_6, "Fit from gam_new_6")
## Column 2 ['data_time'] of item 2 is missing in item 1. Use fill=TRUE to fill with NA (NULL for list columns), or use.names=FALSE to ignore column names. use.names='check' (default from v1.12.2) emits this message and proceeds as if use.names=FALSE for  backwards compatibility. See news item 5 in v1.12.2 for options to control this message.

Build the seventh model – without using smooth function of GAM.

gam_new_simple <- gam(Load ~ Daily+Weekly,
              data = matrix_new,
              family = gaussian)

Evaluate the seventh model.

eval_7 <- gam_eval(gam_new_simple)
eval_7
##          RSQ      GCV     MAPE
##        <num>    <num>    <num>
## 1: 0.4302044 629325.7 25.61359

Plot the seventh model.

gam_plot(gam_new_simple, "Fit from gam_new_simple")
## Column 2 ['data_time'] of item 2 is missing in item 1. Use fill=TRUE to fill with NA (NULL for list columns), or use.names=FALSE to ignore column names. use.names='check' (default from v1.12.2) emits this message and proceeds as if use.names=FALSE for  backwards compatibility. See news item 5 in v1.12.2 for options to control this message.

Build the eighth model – only apply te smooth function on the variable Daily.

gam_new_te_daily <- gam(Load ~ te(Daily, bs = "cr", k = period) +Weekly,
             data = matrix_new,
             family = gaussian)

Evaluate the eighth model.

eval_8 <- gam_eval(gam_new_te_daily)
eval_8
##          RSQ      GCV     MAPE
##        <num>    <num>    <num>
## 1: 0.6366094 402558.4 20.44199

Plot the eighth model.

gam_plot(gam_new_te_daily, "Fit from gam_new_te_daily")
## Column 2 ['data_time'] of item 2 is missing in item 1. Use fill=TRUE to fill with NA (NULL for list columns), or use.names=FALSE to ignore column names. use.names='check' (default from v1.12.2) emits this message and proceeds as if use.names=FALSE for  backwards compatibility. See news item 5 in v1.12.2 for options to control this message.

Build the nineth model – only apply te smooth function on the variable Weekly.

gam_new_te_weekly <- gam(Load ~ te(Weekly, bs = "ps", k = 7) +Daily,
             data = matrix_new,
             family = gaussian)

Evaluate the nineth model.

eval_9 <- gam_eval(gam_new_te_weekly)
eval_9
##          RSQ    GCV   MAPE
##        <num>  <num>  <num>
## 1: 0.5124041 539142 24.279

Plot the nineth model.

gam_plot(gam_new_te_weekly, "Fit from gam_new_te_weekly")
## Column 2 ['data_time'] of item 2 is missing in item 1. Use fill=TRUE to fill with NA (NULL for list columns), or use.names=FALSE to ignore column names. use.names='check' (default from v1.12.2) emits this message and proceeds as if use.names=FALSE for  backwards compatibility. See news item 5 in v1.12.2 for options to control this message.

Analysis of the models

Use a table to analyse the nine models.

eval_table <- bind_rows(eval_1, eval_2, eval_3, eval_4, eval_5, eval_6, eval_7, eval_8, eval_9)
all_aic <- AIC(gam_new_1, gam_new_2, gam_new_3, gam_new_4, gam_new_5, gam_new_6, gam_new_simple, gam_new_te_daily, gam_new_te_weekly)$AIC

eval_table[, AIC :=all_aic]
eval_table
##          RSQ      GCV      MAPE      AIC
##        <num>    <num>     <num>    <num>
## 1: 0.7194149 311287.5 18.899761 67646.26
## 2: 0.8487935 168039.3 10.623654 64953.21
## 3: 0.8423209 175028.3 10.189855 65131.27
## 4: 0.8796766 135799.2  8.319575 64020.79
## 5: 0.8801601 135306.0  8.337419 64004.82
## 6: 0.8802681 134768.2  8.326550 63987.99
## 7: 0.4302044 629325.7 25.613586 70721.15
## 8: 0.6366094 402558.4 20.441993 68769.43
## 9: 0.5124041 539142.0 24.278996 70045.54

Predict the electricity consumption of season 2 using the fourth model.

data_test_2qt <- DT[(type == n_type[1] & date %in% n_date[92:183])]
matrix_test_2qt <- data.table(Load = data_test_2qt[, value],
                           Daily = rep(1:period, 91),
                           Weekly = data_test_2qt[, week_num])
pred_2qt <- predict(gam_new_4, matrix_test_2qt[1:(7*period)],interval="confidence", level = 0.95)

datat <- rbindlist(list(data_test_2qt[, .(value, date_time)],
                        data.table(value = pred_2qt,
                                   data_time = data_test_2qt[, date_time])))
## Column 2 ['data_time'] of item 2 is missing in item 1. Use fill=TRUE to fill with NA (NULL for list columns), or use.names=FALSE to ignore column names. use.names='check' (default from v1.12.2) emits this message and proceeds as if use.names=FALSE for  backwards compatibility. See news item 5 in v1.12.2 for options to control this message.
datat[, type := c(rep("Real", nrow(data_test_2qt)), rep("Predicted", nrow(data_test_2qt)))]
ggplot(data = datat, aes(date_time, value, group = type, colour = type)) +
  geom_line(size = 0.8) +
  theme_bw() +
  labs(x = "Time", y = "Load (kW)",
       title = "Predicted result on GAM n.6")

Predict the electricity consumption of season 3 using the fourth model.

data_test_3qt <- DT[(type == n_type[1] & date %in% n_date[183:274])]
matrix_test_3qt <- data.table(Load = data_test_3qt[, value],
                           Daily = rep(1:period, 91),
                           Weekly = data_test_3qt[, week_num])
pred_3qt <- predict(gam_new_4, matrix_test_3qt[1:(7*period)],interval="confidence", level = 0.95)

datat <- rbindlist(list(data_test_3qt[, .(value, date_time)],
                        data.table(value = pred_3qt,
                                   data_time = data_test_3qt[, date_time])))
## Column 2 ['data_time'] of item 2 is missing in item 1. Use fill=TRUE to fill with NA (NULL for list columns), or use.names=FALSE to ignore column names. use.names='check' (default from v1.12.2) emits this message and proceeds as if use.names=FALSE for  backwards compatibility. See news item 5 in v1.12.2 for options to control this message.
datat[, type := c(rep("Real", nrow(data_test_3qt)), rep("Predicted", nrow(data_test_3qt)))]
ggplot(data = datat, aes(date_time, value, group = type, colour = type)) +
  geom_line(size = 0.8) +
  theme_bw() +
  labs(x = "Time", y = "Load (kW)",
       title = "Predicted result on GAM n.6")

Predict the electricity consumption of season 4 using the fourth model.

data_test_4qt <- DT[(type == n_type[1] & date %in% n_date[274:365])]
matrix_test_4qt <- data.table(Load = data_test_4qt[, value],
                           Daily = rep(1:period, 91),
                           Weekly = data_test_4qt[, week_num])
pred_4qt <- predict(gam_new_4, matrix_test_4qt[1:(7*period)],interval="confidence", level = 0.95)

datat <- rbindlist(list(data_test_4qt[, .(value, date_time)],
                        data.table(value = pred_4qt,
                                   data_time = data_test_4qt[, date_time])))
## Column 2 ['data_time'] of item 2 is missing in item 1. Use fill=TRUE to fill with NA (NULL for list columns), or use.names=FALSE to ignore column names. use.names='check' (default from v1.12.2) emits this message and proceeds as if use.names=FALSE for  backwards compatibility. See news item 5 in v1.12.2 for options to control this message.
datat[, type := c(rep("Real", nrow(data_test_4qt)), rep("Predicted", nrow(data_test_4qt)))]
ggplot(data = datat, aes(date_time, value, group = type, colour = type)) +
  geom_line(size = 0.8) +
  theme_bw() +
  labs(x = "Time", y = "Load (kW)",
       title = "Predicted result on GAM n.6")

Compute the MAPE of the predictions.

mape_2qt <- mape(matrix_test_2qt[1:(7*period)]$Load, pred_2qt)
mape_3qt <- mape(matrix_test_3qt[1:(7*period)]$Load, pred_3qt)
mape_4qt <- mape(matrix_test_4qt[1:(7*period)]$Load, pred_4qt)
mapes <- cbind(mape_2qt, mape_3qt, mape_4qt)
mapes
##      mape_2qt mape_3qt mape_4qt
## [1,] 12.01464 14.89136 12.93795